1
00:00:11,077 --> 00:00:14,258
- Okay we have a lot to cover
today so let's get started.

2
00:00:14,258 --> 00:00:17,454
Today we'll be talking
about Generative Models.

3
00:00:17,454 --> 00:00:20,484
And before we start, a few
administrative details.

4
00:00:20,484 --> 00:00:23,522
So midterm grades will be
released on Gradescope this week

5
00:00:23,522 --> 00:00:27,730
A reminder that A3 is
due next Friday May 26th.

6
00:00:27,730 --> 00:00:32,709
The HyperQuest deadline for extra credit you
can do this still until Sunday May 21st.

7
00:00:33,632 --> 00:00:37,799
And our poster session is
June 6th from 12 to 3 P.M..

8
00:00:40,812 --> 00:00:47,759
Okay so an overview of what we're going to talk about today we're going to
switch gears a little bit and take a look at unsupervised learning today.

9
00:00:47,759 --> 00:00:54,103
And in particular we're going to talk about generative
models which is a type of unsupervised learning.

10
00:00:54,103 --> 00:00:57,112
And we'll look at three
types of generative models.

11
00:00:57,112 --> 00:01:01,174
So pixelRNNs and pixelCNNs
variational autoencoders

12
00:01:01,174 --> 00:01:04,174
and Generative Adversarial networks.

13
00:01:05,571 --> 00:01:11,168
So so far in this class we've talked a lot about supervised
learning and different kinds of supervised learning problems.

14
00:01:11,168 --> 00:01:16,078
So in the supervised learning set up we have
our data X and then we have some labels Y.

15
00:01:16,078 --> 00:01:21,417
And our goal is to learn a function that's
mapping from our data X to our labels Y.

16
00:01:21,417 --> 00:01:26,237
And these labels can take
many different types of forms.

17
00:01:26,237 --> 00:01:34,934
So for example, we've looked at classification where our input is
an image and we want to output Y, a class label for the category.

18
00:01:34,934 --> 00:01:44,093
We've talked about object detection where now our input is still an image but here
we want to output the bounding boxes of instances of up to multiple dogs or cats.

19
00:01:46,138 --> 00:01:51,986
We've talked about semantic segmentation where here we have a
label for every pixel the category that every pixel belongs to.

20
00:01:53,572 --> 00:01:58,961
And we've also talked about image captioning
where here our label is now a sentence

21
00:01:58,961 --> 00:02:02,961
and so it's now in the
form of natural language.

22
00:02:03,998 --> 00:02:15,661
So unsupervised learning in this set up, it's a type of learning where here we have unlabeled
training data and our goal now is to learn some underlying hidden structure of the data.

23
00:02:15,661 --> 00:02:20,370
Right, so an example of this can be something like
clustering which you guys might have seen before

24
00:02:20,370 --> 00:02:25,029
where here the goal is to find groups within the
data that are similar through some type of metric.

25
00:02:25,029 --> 00:02:27,187
For example, K means clustering.

26
00:02:27,187 --> 00:02:32,871
Another example of an unsupervised learning
task is a dimensionality reduction.

27
00:02:33,777 --> 00:02:38,939
So in this problem want to find axes along which
our training data has the most variation,

28
00:02:38,939 --> 00:02:43,537
and so these axes are part of the
underlying structure of the data.

29
00:02:43,537 --> 00:02:51,095
And then we can use this to reduce of dimensionality of the data such that
the data has significant variation among each of the remaining dimensions.

30
00:02:51,095 --> 00:02:57,842
Right, so this example here we start off with data in three
dimensions and we're going to find two axes of variation in this case

31
00:02:57,842 --> 00:03:01,259
and reduce our data projected down to 2D.

32
00:03:04,205 --> 00:03:09,964
Another example of unsupervised learning is
learning feature representations for data.

33
00:03:11,006 --> 00:03:17,209
We've seen how to do this in supervised ways before where
we used the supervised loss, for example classification.

34
00:03:17,209 --> 00:03:21,617
Where we have the classification label.
We have something like a Softmax loss

35
00:03:21,617 --> 00:03:29,869
And we can train a neural network where we can interpret activations for
example our FC7 layers as some kind of future representation for the data.

36
00:03:29,869 --> 00:03:35,742
And in an unsupervised setting, for example here
autoencoders which we'll talk more about later

37
00:03:35,742 --> 00:03:46,872
In this case our loss is now trying to reconstruct the input data to basically,
you have a good reconstruction of our input data and use this to learn features.

38
00:03:46,872 --> 00:03:52,245
So we're learning a feature representation
without using any additional external labels.

39
00:03:53,471 --> 00:03:59,585
And finally another example of unsupervised learning
is density estimation where in this case we want to

40
00:03:59,585 --> 00:04:02,884
estimate the underlying
distribution of our data.

41
00:04:02,884 --> 00:04:10,811
So for example in this top case over here, we have points
in 1-d and we can try and fit a Gaussian into this density

42
00:04:10,811 --> 00:04:16,605
and in this bottom example over here it's 2D data and
here again we're trying to estimate the density and

43
00:04:16,605 --> 00:04:24,239
we can model this density. We want to fit a model such that
the density is higher where there's more points concentrated.

44
00:04:26,100 --> 00:04:35,990
And so to summarize the differences in unsupervised learning which we've looked
a lot so far, we want to use label data to learn a function mapping from X to Y

45
00:04:35,990 --> 00:04:44,124
and an unsupervised learning we use no labels and instead we try to learn
some underlying hidden structure of the data, whether this is grouping,

46
00:04:44,124 --> 00:04:48,291
acts as a variation or
underlying density estimation.

47
00:04:49,662 --> 00:04:54,113
And unsupervised learning is a huge
and really exciting area of research and

48
00:04:54,113 --> 00:05:04,339
and some of the reasons are that training data is really cheap, it doesn't use labels
so we're able to learn from a lot of data at one time and basically utilize a lot

49
00:05:04,339 --> 00:05:09,977
more data than if we required annotating
or finding labels for data.

50
00:05:09,977 --> 00:05:17,823
And unsupervised learning is still relatively unsolved research
area by comparison. There's a lot of open problems in this,

51
00:05:17,823 --> 00:05:24,669
but it also, it holds the potential of if you're able to
successfully learn and represent a lot of the underlying structure

52
00:05:24,669 --> 00:05:32,729
in the data then this also takes you a long way towards the Holy
Grail of trying to understand the structure of the visual world.

53
00:05:35,026 --> 00:05:40,432
So that's a little bit of kind of a high-level
big picture view of unsupervised learning.

54
00:05:40,432 --> 00:05:44,155
And today will focus more
specifically on generative models

55
00:05:44,155 --> 00:05:52,933
which is a class of models for unsupervised learning where given training
data our goal is to try and generate new samples from the same distribution.

56
00:05:52,933 --> 00:05:57,686
Right, so we have training data over here
generated from some distribution P data

57
00:05:57,686 --> 00:06:04,955
and we want to learn a model, P model to
generate samples from the same distribution

58
00:06:04,955 --> 00:06:09,854
and so we want to learn P
model to be similar to P data.

59
00:06:09,854 --> 00:06:12,636
And generative models
address density estimations.

60
00:06:12,636 --> 00:06:22,180
So this problem that we saw earlier of trying to estimate the underlying
distribution of your training data which is a core problem in unsupervised learning.

61
00:06:22,180 --> 00:06:25,190
And we'll see that there's
several flavors of this.

62
00:06:25,190 --> 00:06:33,353
We can use generative models to do explicit density estimation
where we're going to explicitly define and solve for our P model

63
00:06:35,045 --> 00:06:37,610
or we can also do implicit
density estimation

64
00:06:37,610 --> 00:06:45,035
where in this case we'll learn a model that can produce
samples from P model without explicitly defining it.

65
00:06:47,700 --> 00:06:54,096
So, why do we care about generative models? Why is this a
really interesting core problem in unsupervised learning?

66
00:06:54,096 --> 00:06:57,451
Well there's a lot of things that
we can do with generative models.

67
00:06:57,451 --> 00:07:04,659
If we're able to create realistic samples from the data distributions
that we want we can do really cool things with this, right?

68
00:07:04,659 --> 00:07:14,568
We can generate just beautiful samples to start with so on the left you can
see a completely new samples of just generated by these generative models.

69
00:07:14,568 --> 00:07:21,042
Also in the center here generated samples of
images we can also do tasks like super resolution,

70
00:07:21,042 --> 00:07:32,145
colorization so hallucinating or filling in these edges with
generated ideas of colors and what the purse should look like.

71
00:07:32,145 --> 00:07:41,619
We can also use generative models of time series data for simulation and
planning and so this will be useful in for reinforcement learning applications

72
00:07:41,619 --> 00:07:45,089
which we'll talk a bit more about
reinforcement learning in a later lecture.

73
00:07:45,089 --> 00:07:50,261
And training generative models can also
enable inference of latent representations.

74
00:07:50,261 --> 00:07:57,435
Learning latent features that can be useful
as general features for downstream tasks.

75
00:07:59,059 --> 00:08:05,688
So if we look at types of generative models
these can be organized into the taxonomy here

76
00:08:05,688 --> 00:08:13,180
where we have these two major branches that we talked
about, explicit density models and implicit density models.

77
00:08:13,180 --> 00:08:19,062
And then we can also get down into many
of these other sub categories.

78
00:08:19,062 --> 00:08:27,814
And well we can refer to this figure is adapted
from a tutorial on GANs from Ian Goodfellow

79
00:08:27,814 --> 00:08:36,861
and so if you're interested in some of these different taxonomy and categorizations
of generative models this is a good resource that you can take a look at.

80
00:08:36,861 --> 00:08:45,645
But today we're going to discuss three of the most popular types
of generative models that are in use and in research today.

81
00:08:45,645 --> 00:08:49,475
And so we'll talk first briefly
about pixelRNNs and CNNs

82
00:08:49,475 --> 00:08:52,162
And then we'll talk about
variational autoencoders.

83
00:08:52,162 --> 00:08:55,661
These are both types of
explicit density models.

84
00:08:55,661 --> 00:08:57,494
One that's using a tractable density

85
00:08:57,494 --> 00:09:01,312
and another that's using
an approximate density

86
00:09:01,312 --> 00:09:05,614
And then we'll talk about
generative adversarial networks,

87
00:09:05,614 --> 00:09:09,781
GANs which are a type of
implicit density estimation.

88
00:09:12,152 --> 00:09:16,304
So let's first talk
about pixelRNNs and CNNs.

89
00:09:16,304 --> 00:09:20,015
So these are a type of fully
visible belief networks

90
00:09:20,015 --> 00:09:22,432
which are modeling a density explicitly

91
00:09:22,432 --> 00:09:34,941
so in this case what they do is we have this image data X that we have and we want to model the
probability or likelihood of this image P of X. Right and so in this case, for these kinds of models,

92
00:09:34,941 --> 00:09:40,384
we use the chain rule to decompose this likelihood
into a product of one dimensional distribution.

93
00:09:40,384 --> 00:09:43,493
So we have here the
probability of each pixel X I

94
00:09:43,493 --> 00:09:47,871
conditioned on all previous
pixels X1 through XI - 1.

95
00:09:47,871 --> 00:09:58,073
and your likelihood all right, your joint likelihood of all the pixels in your image is
going to be the product of all of these pixels together, all of these likelihoods together.

96
00:09:58,073 --> 00:10:08,938
And then once we define this likelihood, in order to train this model we can
just maximize the likelihood of our training data under this defined density.

97
00:10:10,980 --> 00:10:20,833
So if we look at this this distribution over pixel values right, we have this P of
XI given all the previous pixel values, well this is a really complex distribution.

98
00:10:20,833 --> 00:10:22,700
So how can we model this?

99
00:10:22,700 --> 00:10:29,042
Well we've seen before that if we want to have complex
transformations we can do these using neural networks.

100
00:10:29,042 --> 00:10:32,828
Neural networks are a good way to
express complex transformations.

101
00:10:32,828 --> 00:10:42,300
And so what we'll do is we'll use a neural network to express
this complex function that we have of the distribution.

102
00:10:43,235 --> 00:10:44,796
And one thing you'll see here is that,

103
00:10:44,796 --> 00:10:51,212
okay even if we're going to use a neural network for this another
thing we have to take care of is how do we order the pixels.

104
00:10:51,212 --> 00:10:58,886
Right, I said here that we have a distribution for P of XI given
all previous pixels but what does all previous the pixels mean?

105
00:10:58,886 --> 00:11:01,303
So we'll take a look at that.

106
00:11:03,336 --> 00:11:06,669
So PixelRNN was a model proposed in 2016

107
00:11:07,595 --> 00:11:17,657
that basically defines a way for setting up and
optimizing this problem and so how this model works is

108
00:11:17,657 --> 00:11:21,187
that we're going to generate pixels
starting in a corner of the image.

109
00:11:21,187 --> 00:11:31,050
So we can look at this grid as basically the pixels of your image and so
what we're going to do is start from the pixel in the upper left-hand corner

110
00:11:31,050 --> 00:11:37,195
and then we're going to sequentially generate pixels based
on these connections from the arrows that you can see here.

111
00:11:37,195 --> 00:11:44,332
And each of the dependencies on the previous pixels
in this ordering is going to be modeled using an RNN

112
00:11:44,332 --> 00:11:48,092
or more specifically an LSTM which
we've seen before in lecture.

113
00:11:48,092 --> 00:11:55,242
Right so using this we can basically continue to
move forward just moving down a long is diagonal

114
00:11:55,242 --> 00:12:01,244
and generating all of these pixel values dependent
on the pixels that they're connected to.

115
00:12:01,244 --> 00:12:08,736
And so this works really well but the drawback here is that this
sequential generation, right, so it's actually quite slow to do this.

116
00:12:08,736 --> 00:12:15,061
You can imagine you know if you're going to generate a new image instead
of all of these feed forward networks that we see, we've seen with CNNs.

117
00:12:15,061 --> 00:12:20,952
Here we're going to have to iteratively go through
and generate all these images, all these pixels.

118
00:12:24,044 --> 00:12:30,575
So a little bit later, after a pixelRNN,
another model called pixelCNN was introduced.

119
00:12:30,575 --> 00:12:34,570
And this has very
similar setup as pixelCNN

120
00:12:34,570 --> 00:12:43,074
and we're still going to do this image generation starting from the corner of the of
the image and expanding outwards but the difference now is that now instead of using

121
00:12:43,074 --> 00:12:47,752
an RNN to model all these dependencies
we're going to use the CNN instead.

122
00:12:47,752 --> 00:12:52,179
And we're now going to use a
CNN over a a context region

123
00:12:52,179 --> 00:12:56,384
that you can see here around in the particular
pixel that we're going to generate now.

124
00:12:56,384 --> 00:13:09,313
Right so we take the pixels around it, this gray area within the region that's already been
generated and then we can pass this through a CNN and use that to generate our next pixel value.

125
00:13:11,041 --> 00:13:18,055
And so what this is going to give is this is going to give
This is a CNN, a neural network at each pixel location

126
00:13:18,055 --> 00:13:22,967
right and so the output of this is going to be
a soft max loss over the pixel values here.

127
00:13:22,967 --> 00:13:31,193
In this case we have a 0 to 255 and then we can train
this by maximizing the likelihood of the training images.

128
00:13:31,193 --> 00:13:43,482
Right so we say that basically we want to take a training image we're going to
do this generation process and at each pixel location we have the ground truth

129
00:13:43,482 --> 00:13:53,976
training data image value that we have here and this is a quick basically the label or
the the the classification label that we want our pixel to be which of these 255 values

130
00:13:53,976 --> 00:13:56,723
and we can train this
using a Softmax loss.

131
00:13:56,723 --> 00:14:05,597
Right and so basically the effect of doing this is that we're going to
maximize the likelihood of our training data pixels being generated.

132
00:14:05,597 --> 00:14:08,413
Okay any questions about this?
Yes.

133
00:14:08,413 --> 00:14:12,159
[student's words obscured
due to lack of microphone]

134
00:14:12,159 --> 00:14:18,675
Yeah, so the question is, I thought we were talking about unsupervised
learning, why do we have basically a classification label here?

135
00:14:18,675 --> 00:14:24,970
The reason is that this loss, this output that
we have is the value of the input training data.

136
00:14:24,970 --> 00:14:26,983
So we have no external labels, right?

137
00:14:26,983 --> 00:14:38,533
We didn't go and have to manually collect any labels for this, we're just taking
our input data and saying that this is what we used for the last function.

138
00:14:41,199 --> 00:14:45,366
[student's words obscured
due to lack of microphone]

139
00:14:47,998 --> 00:14:50,746
The question is, is
this like bag of words?

140
00:14:50,746 --> 00:14:53,109
I would say it's not really bag of words,

141
00:14:53,109 --> 00:15:01,466
it's more saying that we want where we're outputting a distribution over
pixel values at each location of our image right, and what we want to do

142
00:15:01,466 --> 00:15:10,442
is we want to maximize the likelihood of our input,
our training data being produced, being generated.

143
00:15:10,442 --> 00:15:15,761
Right so, in that sense, this is why it's
using our input data to create our loss.

144
00:15:21,006 --> 00:15:24,904
So using pixelCNN training
is faster than pixelRNN

145
00:15:24,904 --> 00:15:34,301
because here now right at every pixel location we want to maximize the
value of our, we want to maximize the likelihood of our training data

146
00:15:34,301 --> 00:15:40,739
showing up and so we have all of these values already right,
just from our training data and so we can do this much

147
00:15:40,739 --> 00:15:47,296
faster but a generation time for a test time we want to
generate a completely new image right, just starting from

148
00:15:47,296 --> 00:15:59,197
the corner and we're not, we're not trying to do any type of learning so in that generation time
we still have to generate each of these pixel locations before we can generate the next location.

149
00:15:59,197 --> 00:16:03,025
And so generation time here it still slow
even though training time is faster.

150
00:16:03,025 --> 00:16:04,204
Question.

151
00:16:04,204 --> 00:16:08,365
[student's words obscured
due to lack of microphone]

152
00:16:08,365 --> 00:16:14,077
So the question is, is this training a sensitive
distribution to what you pick for the first pixel?

153
00:16:14,077 --> 00:16:21,208
Yeah, so it is dependent on what you have as the initial pixel
distribution and then everything is conditioned based on that.

154
00:16:23,203 --> 00:16:32,171
So again, how do you pick this distribution? So at training time you have
these distributions from your training data and then at generation time

155
00:16:32,171 --> 00:16:38,368
you can just initialize this with either uniform
or from your training data, however you want.

156
00:16:38,368 --> 00:16:42,553
And then once you have that everything
else is conditioned based on that.

157
00:16:42,553 --> 00:16:43,912
Question.

158
00:16:43,912 --> 00:16:48,079
[student's words obscured
due to lack of microphone]

159
00:17:07,415 --> 00:17:14,146
Yeah so the question is is there a way that we define this in this
chain rule fashion instead of predicting all the pixels at one time?

160
00:17:14,146 --> 00:17:17,884
And so we'll see, we'll see
models later that do do this,

161
00:17:17,884 --> 00:17:27,868
but what the chain rule allows us to do is it allows us to find this very tractable
density that we can then basically optimize and do, directly optimizes likelihood

162
00:17:31,864 --> 00:17:39,606
Okay so these are some examples of generations from
this model and so here on the left you can see

163
00:17:39,606 --> 00:17:48,846
generations where the training data is CIFAR-10, CIFAR-10 dataset. And so you can
see that in general they are starting to capture statistics of natural images.

164
00:17:48,846 --> 00:17:56,848
You can see general types of blobs and kind of things
that look like parts of natural images coming out.

165
00:17:56,848 --> 00:18:02,768
On the right here it's ImageNet, we can again see samples
from here and these are starting to look like natural images

166
00:18:05,060 --> 00:18:09,966
but they're still not, there's
still room for improvement.

167
00:18:09,966 --> 00:18:17,059
You can still see that there are differences obviously with regional
training images and some of the semantics are not clear in here.

168
00:18:19,371 --> 00:18:27,020
So, to summarize this, pixelRNNs and CNNs allow
you to explicitly compute likelihood P of X.

169
00:18:27,020 --> 00:18:29,297
It's an explicit density
that we can optimize.

170
00:18:29,297 --> 00:18:34,043
And being able to do this also has another
benefit of giving a good evaluation metric.

171
00:18:34,043 --> 00:18:40,958
You know you can kind of measure how good your samples
are by this likelihood of the data that you can compute.

172
00:18:40,958 --> 00:18:47,043
And it's able to produce pretty good samples
but it's still an active area of research

173
00:18:47,043 --> 00:18:53,760
and the main disadvantage of these methods is that the
generation is sequential and so it can be pretty slow.

174
00:18:53,760 --> 00:18:59,324
And these kinds of methods have also been
used for generating audio for example.

175
00:18:59,324 --> 00:19:08,170
And you can look online for some pretty interesting examples of this, but
again the drawback is that it takes a long time to generate these samples.

176
00:19:08,170 --> 00:19:14,565
And so there's a lot of work, has been work since
then on still on improving pixelCNN performance

177
00:19:14,565 --> 00:19:22,346
And so all kinds of different you know architecture changes add the loss
function formulating this differently on different types of training tricks

178
00:19:22,346 --> 00:19:29,495
And so if you're interested in learning more about
this you can look at some of these papers on PixelCNN

179
00:19:29,495 --> 00:19:35,115
and then other pixelCNN plus plus better
improved version that came out this year.

180
00:19:37,455 --> 00:19:44,321
Okay so now we're going to talk about another type
of generative models call variational autoencoders.

181
00:19:44,321 --> 00:19:52,204
And so far we saw that pixelCNNs defined a tractable
density function, right, using this this definition

182
00:19:52,204 --> 00:19:58,365
and based on that we can optimize directly
optimize the likelihood of the training data.

183
00:19:59,419 --> 00:20:04,195
So with variational autoencoders now we're going
to define an intractable density function.

184
00:20:04,195 --> 00:20:10,769
We're now going to model this with an additional latent
variable Z and we'll talk in more detail about how this looks.

185
00:20:10,769 --> 00:20:17,886
And so our data likelihood P of X is now
basically has to be this integral right,

186
00:20:17,886 --> 00:20:21,422
taking the expectation over
all possible values of Z.

187
00:20:21,422 --> 00:20:26,909
And so this now is going to be a problem. We'll
see that we cannot optimize this directly.

188
00:20:26,909 --> 00:20:33,706
And so instead what we have to do is we have to derive
and optimize a lower bound on the likelihood instead.

189
00:20:33,706 --> 00:20:34,956
Yeah, question.

190
00:20:35,864 --> 00:20:37,592
So the question is is what is Z?

191
00:20:37,592 --> 00:20:42,862
Z is a latent variable and I'll go
through this in much more detail.

192
00:20:44,479 --> 00:20:48,538
So let's talk about some background first.

193
00:20:48,538 --> 00:20:54,733
Variational autoencoders are related to a type of
unsupervised learning model called autoencoders.

194
00:20:54,733 --> 00:21:00,965
And so we'll talk little bit more first about autoencoders
and what they are and then I'll explain how variational

195
00:21:00,965 --> 00:21:05,851
autoencoders are related and build off
of this and allow you to generate data.

196
00:21:05,851 --> 00:21:09,168
So with autoencoders we don't
use this to generate data,

197
00:21:09,168 --> 00:21:15,719
but it's an unsupervised approach for learning a lower
dimensional feature representation from unlabeled training data.

198
00:21:15,719 --> 00:21:21,550
All right so in this case we have our input data X and then
we're going to want to learn some features that we call Z.

199
00:21:22,541 --> 00:21:29,605
And then we'll have an encoder that's going to be a mapping,
a function mapping from this input data to our feature Z.

200
00:21:30,911 --> 00:21:33,905
And this encoder can take
many different forms right,

201
00:21:33,905 --> 00:21:41,239
they would generally use neural networks so originally these models
have been around, autoencoders have been around for a long time.

202
00:21:41,239 --> 00:21:45,803
So in the 2000s we used linear
layers of non-linearities,

203
00:21:45,803 --> 00:21:54,389
then later on we had fully connected deeper networks and then
after that we moved on to using CNNs for these encoders.

204
00:21:55,385 --> 00:22:01,351
So we take our input data X and
then we map this to some feature Z.

205
00:22:01,351 --> 00:22:11,817
And Z we usually have as, we usually specify this to be smaller than
X and we perform basically dimensionality reduction because of that.

206
00:22:11,817 --> 00:22:17,729
So the question who has an idea of why do we
want to do dimensionality reduction here?

207
00:22:17,729 --> 00:22:20,896
Why do we want Z to be smaller than X?

208
00:22:22,114 --> 00:22:25,497
Yeah. [student's words obscured
due to lack of microphone]

209
00:22:25,497 --> 00:22:31,657
So the answer I heard is Z should represent the
most important features in X and that's correct.

210
00:22:32,634 --> 00:22:41,758
So we want Z to be able to learn features that can capture meaningful
factors of variation in the data. Right this makes them good features.

211
00:22:42,833 --> 00:22:46,717
So how can we learn this
feature representation?

212
00:22:46,717 --> 00:22:55,944
Well the way autoencoders do this is that we train the model such
that the features can be used to reconstruct our original data.

213
00:22:55,944 --> 00:23:03,730
So what we want is we want to have input data that we use
an encoder to map it to some lower dimensional features Z.

214
00:23:05,320 --> 00:23:06,926
This is the output of the encoder network,

215
00:23:06,926 --> 00:23:16,554
and we want to be able to take these features that were produced based on this input
data and then use a decoder a second network and be able to output now something

216
00:23:16,554 --> 00:23:24,865
of the same size dimensionality as X and have it be similar to X
right so we want to be able to reconstruct the original data.

217
00:23:26,387 --> 00:23:38,583
And again for the decoder we are basically using same types of networks as encoders so
it's usually a little bit symmetric and now we can use CNN networks for most of these.

218
00:23:41,675 --> 00:23:48,720
Okay so the process is going to be we're going to take our
input data right we pass it through our encoder first

219
00:23:48,720 --> 00:23:53,996
which is going to be something for example like a four layer
convolutional network and then we're going to pass it,

220
00:23:53,996 --> 00:24:04,196
get these features and then we're going to pass it through a decoder which is a four layer for
example upconvolutional network and then get a reconstructed data out at the end of this.

221
00:24:04,196 --> 00:24:14,409
Right in the reason why we have a convolutional network for the encoder and an
upconvolutional network for the decoder is because at the encoder we're basically

222
00:24:14,409 --> 00:24:25,893
taking it from this high dimensional input to these lower dimensional features and now we want to go the
other way go from our low dimensional features back out to our high dimensional reconstructed input.

223
00:24:28,906 --> 00:24:39,071
And so in order to get this effect that we said we wanted before of being able
to reconstruct our input data we'll use something like an L2 loss function.

224
00:24:39,071 --> 00:24:49,306
Right that basically just says let me make my pixels of my input data to be the same as
my, my pixels in my reconstructed data to be the same as the pixels of my input data.

225
00:24:51,078 --> 00:24:58,599
An important thing to notice here, this relates back to a question that
we had earlier, is that even though we have this loss function here,

226
00:24:58,599 --> 00:25:02,515
there's no, there's no external labels
that are being used in training this.

227
00:25:02,515 --> 00:25:10,861
All we have is our training data that we're going to use both to
pass through the network as well as to compute our loss function.

228
00:25:13,346 --> 00:25:19,021
So once we have this after training this model
what we can do is we can throw away this decoder.

229
00:25:19,021 --> 00:25:26,108
All this was used was too to be able to produce our
reconstruction input and be able to compute our loss function.

230
00:25:26,108 --> 00:25:34,819
And we can use the encoder that we have which produces our feature
mapping and we can use this to initialize a supervised model.

231
00:25:34,819 --> 00:25:45,773
Right and so for example we can now go from this input to our features and then
have an additional classifier network on top of this that now we can use to output

232
00:25:45,773 --> 00:25:55,601
a class label for example for classification problem we can have external
labels from here and use our standard loss functions like Softmax.

233
00:25:55,601 --> 00:26:04,449
And so the value of this is that we basically were able to use a lot of
unlabeled training data to try and learn good general feature representations.

234
00:26:04,449 --> 00:26:12,363
Right, and now we can use this to initialize a supervised learning problem
where sometimes we don't have so much data we only have small data.

235
00:26:12,363 --> 00:26:19,697
And we've seen in previous homeworks and classes that
with small data it's hard to learn a model, right?

236
00:26:19,697 --> 00:26:22,563
You can have over fitting
and all kinds of problems

237
00:26:22,563 --> 00:26:27,540
and so this allows you to initialize
your model first with better features.

238
00:26:31,371 --> 00:26:42,329
Okay so we saw that autoencoders are able to reconstruct data and are able to, as a
result, learn features to initialize, that we can use to initialize a supervised model.

239
00:26:42,329 --> 00:26:50,133
And we saw that these features that we learned have this intuition
of being able to capture factors of variation in the training data.

240
00:26:50,133 --> 00:26:58,941
All right so based on this intuition of okay these, we can have this
latent this vector Z which has factors of variation in our training data.

241
00:26:58,941 --> 00:27:04,957
Now a natural question is well can we use a
similar type of setup to generate new images?

242
00:27:06,922 --> 00:27:09,502
And so now we will talk about
variational autoencoders

243
00:27:09,502 --> 00:27:15,987
which is a probabillstic spin on autoencoders that will let
us sample from the model in order to generate new data.

244
00:27:15,987 --> 00:27:19,404
Okay any questions on autoencoders first?

245
00:27:20,796 --> 00:27:22,828
Okay, so variational autoencoders.

246
00:27:22,828 --> 00:27:28,914
All right so here we assume that our
training data that we have X I from one to N

247
00:27:30,255 --> 00:27:34,812
is generated from some underlying,
unobserved latent representation Z.

248
00:27:34,812 --> 00:27:38,357
Right, so it's this intuition
that Z is some vector

249
00:27:38,357 --> 00:27:47,069
right which element of Z is capturing how little or how much
of some factor of variation that we have in our training data.

250
00:27:48,491 --> 00:27:54,811
Right so the intuition is, you know, maybe these could be something like
different kinds of attributes. Let's say we're trying to generate faces,

251
00:27:54,811 --> 00:28:02,608
it could be how much of a smile is on the face, it could
be position of the eyebrows hair orientation of the head.

252
00:28:02,608 --> 00:28:08,772
These are all possible types of
latent factors that could be learned.

253
00:28:08,772 --> 00:28:13,901
Right, and so our generation process is that
we're going to sample from a prior over Z.

254
00:28:13,901 --> 00:28:25,014
Right so for each of these attributes for example, you know, how much smile that there is, we
can have a prior over what sort of distribution we think that there should be for this so,

255
00:28:25,014 --> 00:28:31,571
a gaussian is something that's a natural prior
that we can use for each of these factors of Z

256
00:28:31,571 --> 00:28:40,140
and then we're going to generate our data X by sampling from
a conditional, conditional distribution P of X given Z.

257
00:28:40,140 --> 00:28:48,862
So we sample Z first, we sample a value for each of these latent
factors and then we'll use that and sample our image X from here.

258
00:28:51,409 --> 00:28:57,667
And so the true parameters of this generation
process are theta, theta star right?

259
00:28:57,667 --> 00:29:03,158
So we have the parameters of our prior
and our conditional distributions

260
00:29:03,158 --> 00:29:11,727
and what we want to do is in order to have a generative model be able to
generate new data we want to estimate these parameters of our true parameters

261
00:29:14,790 --> 00:29:17,611
Okay so let's first talk about how
should we represent this model.

262
00:29:20,282 --> 00:29:27,317
All right, so if we're going to have a model for this generator process, well we've
already said before that we can choose our prior P of Z to be something simple.

263
00:29:27,317 --> 00:29:32,713
Something like a Gaussian, right? And this is the
reasonable thing to choose for for latent attributes.

264
00:29:35,696 --> 00:29:40,840
Now for our conditional distribution P of X
given Z this is much more complex right,

265
00:29:40,840 --> 00:29:43,410
because we need to use
this to generate an image

266
00:29:43,410 --> 00:29:53,062
and so for P of X given Z, well as we saw before, when we have some type of complex
function that we want to represent we can represent this with a neural network.

267
00:29:53,062 --> 00:29:58,259
And so that's a natural choice for let's try and
model P of X given Z with a neural network.

268
00:30:00,308 --> 00:30:02,345
And we're going to call
this the decoder network.

269
00:30:02,345 --> 00:30:10,167
Right, so we're going to think about taking some latent representation
and trying to decode this into the image that it's specifying.

270
00:30:10,167 --> 00:30:13,765
So now how can we train this model?

271
00:30:13,765 --> 00:30:19,419
Right, we want to be able to train this model so
that we can learn an estimate of these parameters.

272
00:30:19,419 --> 00:30:26,668
So if we remember our strategy from training generative models, back
from are fully visible belief networks, our pixelRNNs and CNNs,

273
00:30:28,577 --> 00:30:35,498
a straightforward natural strategy is to try and learn these model
parameters in order to maximize the likelihood of the training data.

274
00:30:35,498 --> 00:30:39,346
Right, so we saw earlier that in this case,
with our latent variable Z, we're going to have

275
00:30:39,346 --> 00:30:49,884
to write out P of X taking expectation over all possible values of Z which is continuous
and so we get this expression here. Right so now we have it with this latent Z

276
00:30:49,884 --> 00:30:55,759
and now if we're going to, if you want to try and
maximize its likelihood, well what's the problem?

277
00:30:55,759 --> 00:31:01,372
Can we just take this take gradients
and maximize this likelihood?

278
00:31:01,372 --> 00:31:04,358
[student's words obscured
due to lack of microphone]

279
00:31:04,358 --> 00:31:08,524
Right, so this integral is not going
to be tractable, that's correct.

280
00:31:10,199 --> 00:31:12,547
So let's take a look at this
in a little bit more detail.

281
00:31:12,547 --> 00:31:18,772
Right, so we have our data likelihood
term here. And the first time is P of Z.

282
00:31:18,772 --> 00:31:24,847
And here we already said earlier, we can just choose
this to be a simple Gaussian prior, so this is fine.

283
00:31:24,847 --> 00:31:29,031
P of X given Z, well we said we were going
to specify a decoder neural network.

284
00:31:29,031 --> 00:31:32,774
So given any Z, we can get
P of X given Z from here.

285
00:31:32,774 --> 00:31:35,721
It's the output of our neural network.

286
00:31:35,721 --> 00:31:38,147
But then what's the problem here?

287
00:31:38,147 --> 00:31:48,435
Okay this was supposed to be a different unhappy face but somehow I don't know
what happened, in the process of translation, it turned into a crying black ghost

288
00:31:49,298 --> 00:31:58,591
but what this is symbolizing is that basically if we want to
compute P of X given Z for every Z this is now intractable right,

289
00:31:59,519 --> 00:32:02,186
we cannot compute this integral.

290
00:32:04,794 --> 00:32:06,591
So data likelihood is intractable

291
00:32:06,591 --> 00:32:19,639
and it turns out that if we look at other terms in this model if we look at our posterior
density, So P of our posterior of Z given X, then this is going to be P of X given Z

292
00:32:19,639 --> 00:32:23,712
times P of Z over P of X by Bayes' rule

293
00:32:23,712 --> 00:32:25,740
and this is also going
to be intractable, right.

294
00:32:25,740 --> 00:32:35,143
We have P of X given Z is okay, P of Z is okay, but we have this P
of X our likelihood which has the integral and it's intractable.

295
00:32:36,027 --> 00:32:37,993
So we can't directly optimizes this.

296
00:32:37,993 --> 00:32:45,230
but we'll see that a solution, a solution
that will enable us to learn this model

297
00:32:45,230 --> 00:32:54,824
is if in addition to using a decoder network defining this neural network
to model P of X given Z. If we now define an additional encoder network

298
00:32:54,824 --> 00:33:06,652
Q of Z given X we're going to call this an encoder because we want to turn our
input X into, get the likelihood of Z given X, we're going to encode this into Z.

299
00:33:06,652 --> 00:33:10,329
And defined this network to approximate
the P of Z given X.

300
00:33:12,388 --> 00:33:15,688
Right this was posterior density
term now is also intractable.

301
00:33:15,688 --> 00:33:22,866
If we use this additional network to approximate this
then we'll see that this will actually allow us to derive

302
00:33:22,866 --> 00:33:27,486
a lower bound on the data likelihood that
is tractable and which we can optimize.

303
00:33:29,308 --> 00:33:35,396
Okay so first just to be a little bit more concrete about
these encoder and decoder networks that I specified,

304
00:33:36,579 --> 00:33:40,695
in variational autoencoders we want the
model probabilistic generation of data.

305
00:33:40,695 --> 00:33:51,530
So in autoencoders we already talked about this concept of having an encoder going from
input X to some feature Z and a decoder network going from Z back out to some image X.

306
00:33:53,294 --> 00:33:58,907
And so here we go to again have an encoder network and a
decoder network but we're going to make these probabilistic.

307
00:33:58,907 --> 00:34:06,134
So now our encoder network Q of Z given X with
parameters phi are going to output a mean

308
00:34:06,134 --> 00:34:09,467
and a diagonal covariance and from here,

309
00:34:11,411 --> 00:34:14,795
this will be the direct outputs of our
encoder network and the same thing for our

310
00:34:14,795 --> 00:34:23,109
decoder network which is going to start from Z and now it's
going to output the mean and the diagonal covariance of some X,

311
00:34:23,109 --> 00:34:26,725
same dimension as the input given Z

312
00:34:26,725 --> 00:34:29,478
And then this decoder network
has different parameters theta.

313
00:34:31,136 --> 00:34:42,058
And now in order to actually get our Z and our, This should be Z
given X and X given Z. We'll sample from these distributions.

314
00:34:42,058 --> 00:34:49,072
So now our encoder and our decoder network are
producing distributions over Z and X respectively

315
00:34:49,072 --> 00:34:52,409
and will sample from this distribution
in order to get a value from here.

316
00:34:52,409 --> 00:34:59,630
So you can see how this is taking us on the direction
towards being able to sample and generate new data.

317
00:34:59,630 --> 00:35:05,041
And just one thing to note is that these encoder and decoder
networks, you'll also hear different terms for them.

318
00:35:05,041 --> 00:35:09,138
The encoder network can also be kind of
recognition or inference network because

319
00:35:09,138 --> 00:35:15,913
we're trying to form inference of this latent
representation of Z given X and then for the decoder

320
00:35:15,913 --> 00:35:18,826
network, this is what we'll
use to perform generation.

321
00:35:18,826 --> 00:35:22,993
Right so you also hear
generation network being used.

322
00:35:24,410 --> 00:35:31,899
Okay so now equipped with our encoder and decoder networks,
let's try and work out the data likelihood again.

323
00:35:31,899 --> 00:35:35,117
and we'll use the log of
the data likelihood here.

324
00:35:35,117 --> 00:35:38,833
So we'll see that if we
want the log of P of X right

325
00:35:38,833 --> 00:35:44,988
we can write this out as like a P of X but
take the expectation with respect to Z.

326
00:35:44,988 --> 00:35:51,053
So Z samples from our distribution of Q of Z given X
that we've now defined using the encoder network.

327
00:35:52,606 --> 00:35:58,254
And we can do this because P of X doesn't depend
on Z. Right 'cause Z is not part of that.

328
00:35:58,254 --> 00:36:04,794
And so we'll see that taking the expectation with
respect to Z is going to come in handy later on.

329
00:36:06,255 --> 00:36:20,564
Okay so now from this original expression we can now expand it out to be log of P of X given Z,
P of Z over P of Z given X using Bayes' rule. And so this is just directly writing this out.

330
00:36:20,564 --> 00:36:24,996
And then taking this we can also
now multiply it by a constant.

331
00:36:24,996 --> 00:36:30,874
Right, so Q of Z given X over Q of Z
given X. This is one we can do this.

332
00:36:30,874 --> 00:36:33,847
It doesn't change it but it's
going to be helpful later on.

333
00:36:33,847 --> 00:36:39,444
So given that what we'll do is we'll write
it out into these three separate terms.

334
00:36:39,444 --> 00:36:44,703
And you can work out this math later on by yourself
but it's essentially just using logarithm rules

335
00:36:44,703 --> 00:36:54,728
taking all of these terms that we had in the line above and just separating
it out into these three different terms that will have nice meanings.

336
00:36:56,431 --> 00:37:02,754
Right so if we look at this, the first term that we get
separated out is log of P given X and then expectation

337
00:37:02,754 --> 00:37:07,210
of log of P given X and then we're
going to have two KL terms, right.

338
00:37:07,210 --> 00:37:14,400
This is basically KL divergence term to say
how close these two distributions are.

339
00:37:14,400 --> 00:37:18,567
So how close is a distribution
Q of Z given X to P of Z.

340
00:37:19,489 --> 00:37:24,287
So it's just the, it's exactly
this expectation term above.

341
00:37:24,287 --> 00:37:28,454
And it's just a distance
metric for distributions.

342
00:37:30,908 --> 00:37:36,183
And so we'll see that, right, we saw that these
are nice KL terms that we can write out.

343
00:37:36,183 --> 00:37:39,290
And now if we look at these
three terms that we have here,

344
00:37:39,290 --> 00:37:45,819
the first term is P of X given Z, which
is provided by our decoder network.

345
00:37:45,819 --> 00:37:52,042
And we're able to compute an estimate of these
term through sampling and we'll see that we can

346
00:37:52,042 --> 00:37:56,099
do a sampling that's differentiable through something
called the re-parametrization trick which is a

347
00:37:56,099 --> 00:37:59,920
detail that you can look at this
paper if you're interested.

348
00:37:59,920 --> 00:38:02,479
But basically we can
now compute this term.

349
00:38:02,479 --> 00:38:08,600
And then these KL terms, the second KL
term is a KL between two Gaussians,

350
00:38:08,600 --> 00:38:16,079
so our Q of Z given X, remember our encoder produced this distribution
which had a mean and a covariance, it was a nice Gaussian.

351
00:38:16,079 --> 00:38:19,892
And then also our prior P of
Z which is also a Gaussian.

352
00:38:19,892 --> 00:38:25,628
And so this has a nice, when you have a KL of two Gaussians
you have a nice closed form solution that you can have.

353
00:38:25,628 --> 00:38:31,324
And then this third KL term now, this is
a KL of Q given X with a P of Z given X.

354
00:38:32,303 --> 00:38:36,766
But we know that P of Z given X was this
intractable posterior that we saw earlier, right?

355
00:38:36,766 --> 00:38:41,794
That we didn't want to compute that's
why we had this approximation using Q.

356
00:38:41,794 --> 00:38:44,625
And so this term is still is a problem.

357
00:38:44,625 --> 00:38:54,776
But one thing we do know about this term is that KL divergence, it's a distance
between two distributions is always greater than or equal to zero by definition.

358
00:38:57,060 --> 00:39:03,396
And so what we can do with this is that, well what we have
here, the two terms that we can work nicely with, this is a,

359
00:39:03,396 --> 00:39:10,023
this is a tractable lower bound which we can
actually take gradient of and optimize.

360
00:39:10,023 --> 00:39:16,652
P of X given Z is differentiable and the KL terms are
also, the close form solution is also differentiable.

361
00:39:16,652 --> 00:39:24,168
And this is a lower bound because we know that the KL term
on the right, the ugly one is greater than or equal it zero.

362
00:39:24,168 --> 00:39:26,251
So we have a lower bound.

363
00:39:27,273 --> 00:39:37,699
And so what we'll do to train a variational autoencoder is that we take this
lower bound and we instead optimize and maximize this lower bound instead.

364
00:39:37,699 --> 00:39:42,251
So we're optimizing a lower bound
on the likelihood of our data.

365
00:39:42,251 --> 00:39:49,940
So that means that our data is always going to have a likelihood
that's at least as high as this lower bound that we're maximizing.

366
00:39:49,940 --> 00:39:58,941
And so we want to find the parameters theta, estimate
parameters theta and phi that allows us to maximize this.

367
00:40:03,169 --> 00:40:06,412
And then one last sort of
intuition about this lower bound

368
00:40:06,412 --> 00:40:12,796
that we have is that this first term
is expectation over all samples of Z

369
00:40:12,796 --> 00:40:22,699
sampled from passing our X through the encoder network sampling Z
taking expectation over all of these samples of likelihood of X given Z

370
00:40:24,963 --> 00:40:26,854
and so this is a reconstruction, right?

371
00:40:26,854 --> 00:40:33,300
This is basically saying, if I want this to be big
I want this likelihood P of X given Z to be high,

372
00:40:33,300 --> 00:40:37,756
so it's kind of like trying to do a
good job reconstructing the data.

373
00:40:37,756 --> 00:40:40,528
So similar to what we had
from our autoencoder before.

374
00:40:40,528 --> 00:40:44,695
But the second term here is
saying make this KL small.

375
00:40:46,161 --> 00:40:51,283
Make our approximate posterior distribution
close to our prior distribution.

376
00:40:51,283 --> 00:41:04,558
And this basically is saying that well we want our latent variable Z to be following
this, have this distribution type, distribution shape that we would like it to have.

377
00:41:08,974 --> 00:41:12,058
Okay so any questions about this?

378
00:41:12,058 --> 00:41:19,128
I think this is a lot of math that if you guys are interested you should
go back and kind of work through all of the derivations yourself.

379
00:41:19,128 --> 00:41:19,961
Yeah.

380
00:41:20,883 --> 00:41:23,669
[student's words obscured
due to lack of microphone]

381
00:41:23,669 --> 00:41:29,373
So the question is why do we specify the
prior and the latent variables as Gaussian?

382
00:41:29,373 --> 00:41:33,512
And the reason is that well we're defining
some sort of generative process right,

383
00:41:33,512 --> 00:41:35,930
of sampling Z first and
then sampling X first.

384
00:41:35,930 --> 00:41:53,307
And defining it as a Gaussian is a reasonable type of prior that we can say makes sense for these types of latent
attributes to be distributed according to some sort of Gaussian, and then this lets us now then optimize our model.

385
00:41:55,988 --> 00:42:06,053
Okay, so we talked about how we can deride this lower bound and now let's put
this all together and walk through the process of the training of the AE.

386
00:42:06,053 --> 00:42:10,008
Right so here's the bound that we
want to optimize, to maximize.

387
00:42:10,008 --> 00:42:19,301
And now for a forward pass. We're going to proceed in the following
manner. We have our input data X, so we'll a mini batch of input data.

388
00:42:20,845 --> 00:42:26,544
And then we'll pass it through our encoder
network so we'll get Q of Z given X.

389
00:42:28,439 --> 00:42:35,805
And from this Q of Z given X, this'll be the
terms that we use to compute the KL term.

390
00:42:35,805 --> 00:42:46,856
And then from here we'll sample Z from this distribution of Z given X
so we have a sample of the latent factors that we can infer from X.

391
00:42:50,721 --> 00:42:54,889
And then from here we're going to pass a Z
through another, our second decoder network.

392
00:42:54,889 --> 00:43:07,686
And from the decoder network we'll get this output for the mean and variance on our distribution
for X given Z and then finally we can sample now our X given Z from this distribution

393
00:43:07,686 --> 00:43:12,155
and here this will produce
some sample output.

394
00:43:12,155 --> 00:43:23,517
And when we're training we're going to take this distribution and say well
our loss term is going to be log of our training image pixel values given Z.

395
00:43:23,612 --> 00:43:30,684
So our loss functions going to say let's maximize the
likelihood of this original input being reconstructed.

396
00:43:32,020 --> 00:43:35,919
And so now for every mini batch of input
we're going to compute this forward pass.

397
00:43:35,919 --> 00:43:43,837
Get all these terms that we need and then this is all differentiable
so then we just backprop though all of this and then get our gradient,

398
00:43:43,837 --> 00:43:57,040
we update our model and we use this to continuously update our parameters, our generator and
decoder network parameters theta and phi in order to maximize the likelihood of the trained data.

399
00:43:58,408 --> 00:44:05,547
Okay so once we've trained our VAE, so now to generate data,
what we can do is we can use just the decoder network.

400
00:44:05,547 --> 00:44:15,504
All right, so from here we can sample Z now, instead of sampling Z from this posterior that
we had during training, while during generation we sample from our true generative process.

401
00:44:15,504 --> 00:44:18,673
So we sample from our
prior that we specify.

402
00:44:18,673 --> 00:44:22,840
And then we're going to then
sample our data X from here.

403
00:44:25,281 --> 00:44:34,798
And we'll see that this can produce, in this case, train on MNIST,
these are samples of digits generated from a VAE trained on MNIST.

404
00:44:36,058 --> 00:44:43,796
And you can see that, you know, we talked about this
idea of Z representing these latent factors where we can

405
00:44:43,796 --> 00:44:52,625
bury Z right according to our sample from different parts of our prior
and then get different kind of interpretable meanings from here.

406
00:44:52,625 --> 00:44:57,142
So here we can see that this is
the data manifold for two dimensional Z.

407
00:44:57,142 --> 00:45:08,568
So if we have a two dimensional Z and we take Z and let's say some range from you
know, from different percentiles of the distribution, and we vary Z1 and we vary Z2,

408
00:45:08,568 --> 00:45:16,300
then you can see how the image generated from
every combination of Z1 and Z2 that we have here,

409
00:45:16,300 --> 00:45:22,087
you can see it's transitioning smoothly
across all of these different variations.

410
00:45:24,051 --> 00:45:27,808
And you know our prior on
Z was, it was diagonal,

411
00:45:27,808 --> 00:45:43,006
so we chose this in order to encourage this to be independent latent variables that can then encode interpretable factors of
variation. So because of this now we'll have different dimensions of Z, encoding different interpretable factors of variation.

412
00:45:44,477 --> 00:45:54,771
So, in this example train now on Faces, we'll see as we vary Z1,
going up and down, you'll see the amount of smile changing.

413
00:45:54,771 --> 00:46:00,225
So from a frown at the top to like a big smile
at the bottom and then as we go vary Z2,

414
00:46:01,997 --> 00:46:07,859
from left to right, you can see the head pose changing.
From one direction all the way to the other.

415
00:46:09,883 --> 00:46:18,526
And so one additional thing I want to point out is that as a result of
doing this, these Z variables are also good feature representations.

416
00:46:19,510 --> 00:46:26,376
Because they encode how much of these different these
different interpretable semantics that we have.

417
00:46:26,376 --> 00:46:32,296
And so we can use our Q of Z given X, the
encoder that we've learned and give it an input

418
00:46:32,296 --> 00:46:42,249
images X, we can map this to Z and use the Z as features that we can use for
downstream tasks like supervision, or like classification or other tasks.

419
00:46:47,348 --> 00:46:51,434
Okay so just another couple of
examples of data generated from VAEs.

420
00:46:51,434 --> 00:47:02,231
So on the left here we have data generated on CIFAR-10, trained on CIFAR-10,
and then on the right we have data trained and generated on Faces.

421
00:47:02,231 --> 00:47:08,737
And we'll see so we can see that in general
VAEs are able to generate recognizable data.

422
00:47:08,737 --> 00:47:15,493
One of the main drawbacks of VAEs is that they tend
to still have a bit of a blurry aspect to them.

423
00:47:15,493 --> 00:47:20,520
You can see this in the faces and so this
is still an active area of research.

424
00:47:22,008 --> 00:47:28,030
Okay so to summarize VAEs, they're a
probabilistic spin on traditional autoencoders.

425
00:47:28,030 --> 00:47:36,077
So instead of deterministically taking your input X and
going to Z, feature Z and then back to reconstructing X,

426
00:47:36,077 --> 00:47:43,023
now we have this idea of distributions and sampling
involved which allows us to generate data.

427
00:47:43,023 --> 00:47:51,101
And in order to train this, VAEs are defining an intractable
density. So we can derive and optimize a lower bound,

428
00:47:51,101 --> 00:47:59,718
a variational lower bound, so variational means basically using
approximations to handle these types of intractable expressions.

429
00:47:59,718 --> 00:48:03,577
And so this is why this is
called a variational autoencoder.

430
00:48:03,577 --> 00:48:10,249
And so some of the advantages of this approach
is that VAEs are, they're a principled approach

431
00:48:10,249 --> 00:48:17,628
to generative models and they also allow this inference
query so being able to infer things like Q of Z given X.

432
00:48:17,628 --> 00:48:21,554
That we said could be useful feature
representations for other tasks.

433
00:48:23,101 --> 00:48:29,548
So disadvantages of VAEs are that while we're maximizing
the lower bound of the likelihood, which is okay

434
00:48:29,548 --> 00:48:37,782
like you know in general this is still pushing us in the right
direction and there's more other theoretical analysis of this.

435
00:48:37,782 --> 00:48:48,378
So you know, it's doing okay, but it's maybe not still as direct an
optimization and evaluation as the pixel RNNs and CNNs that we saw earlier,

436
00:48:48,378 --> 00:49:03,348
but which had, and then, also the VAE samples are tending to be a little bit blurrier and of lower quality compared
to state of the art samples that we can see from other generative models such as GANs that we'll talk about next.

437
00:49:04,827 --> 00:49:08,647
And so VAEs now are still, they're
still an active area of research.

438
00:49:11,044 --> 00:49:13,447
People are working on more
flexible approximations,

439
00:49:13,447 --> 00:49:20,881
so richer approximate posteriors, so instead of just
a diagonal Gaussian some richer functions for this.

440
00:49:20,881 --> 00:49:26,992
And then also, another area that people have been working on
is incorporating more structure in these latent variables.

441
00:49:26,992 --> 00:49:31,282
So now we had all of these
independent latent variables

442
00:49:31,282 --> 00:49:38,077
but people are working on having modeling structure
in here, groupings, other types of structure.

443
00:49:41,106 --> 00:49:43,106
Okay, so yeah, question.

444
00:49:44,404 --> 00:49:47,529
[student's words obscured
due to lack of microphone]

445
00:49:47,529 --> 00:49:51,394
Yeah, so the question is we're deciding the
dimensionality of the latent variable.

446
00:49:51,394 --> 00:49:54,727
Yeah, that's something that you specify.

447
00:49:55,874 --> 00:50:07,481
Okay, so we've talked so far about pixelCNNs and VAEs and now we'll take
a look at a third and very popular type of generative model called GANs.

448
00:50:10,019 --> 00:50:15,713
So the models that we've seen so far, pixelCNNs
and RNNs define a tractable density function.

449
00:50:15,713 --> 00:50:19,752
And they optimize the
likelihood of the trained data.

450
00:50:19,752 --> 00:50:27,752
And then VAEs in contrast to that now have this additional
latent variable Z that they define in the generative process.

451
00:50:27,752 --> 00:50:36,858
And so having the Z has a lot of nice properties that we talked about, but
they are also cause us to have this intractable density function that we can't

452
00:50:36,858 --> 00:50:43,934
optimize directly and so we derive and optimize
a lower bound on the likelihood instead.

453
00:50:43,934 --> 00:50:48,486
And so now what if we just give up on
explicitly modeling this density at all?

454
00:50:48,486 --> 00:50:55,267
And we say well what we want is just the ability to
sample and to have nice samples from our distribution.

455
00:50:56,501 --> 00:50:59,175
So this is the approach that GANs take.

456
00:50:59,175 --> 00:51:02,637
So in GANs we don't work with
an explicit density function,

457
00:51:02,637 --> 00:51:05,642
but instead we're going to
take a game-theoretic approach

458
00:51:05,642 --> 00:51:13,839
and we're going to learn to generate from our training distribution through
a set up of a two player game, and we'll talk about this in more detail.

459
00:51:15,255 --> 00:51:24,681
So, in the GAN set up we're saying, okay well what we want, what we care about is
we want to be able to sample from a complex high dimensional training distribution.

460
00:51:24,681 --> 00:51:31,170
So if we think about well we want to produce samples from
this distribution, there's no direct way that we can do this.

461
00:51:31,170 --> 00:51:35,078
We have this very complex distribution,
we can't just take samples from here.

462
00:51:35,078 --> 00:51:46,875
So the solution that we're going to take is that we can, however, sample from simpler
distributions. For example random noise, right? Gaussians are, these we can sample from.

463
00:51:46,875 --> 00:51:56,789
And so what we're going to do is we're going to learn a transformation from
these simple distributions directly to the training distribution that we want.

464
00:51:58,790 --> 00:52:04,304
So the question, what can we used to
represent this complex distribution?

465
00:52:06,120 --> 00:52:07,718
Neural network, I heard the answer.

466
00:52:07,718 --> 00:52:14,373
So when we want to model some kind of complex
function or transformation we use a neural network.

467
00:52:14,373 --> 00:52:23,297
Okay so what we're going to do is we're going to take in the GAN set up, we're
going to take some input which is a vector of some dimension that we specify

468
00:52:23,297 --> 00:52:33,628
of random noise and then we're going to pass this through a generator network, and
then we're going to get as output directly a sample from the training distribution.

469
00:52:33,628 --> 00:52:40,154
So every input of random noise we want to correspond
to a sample from the training distribution.

470
00:52:41,278 --> 00:52:48,737
And so the way we're going to train and learn this network
is that we're going to look at this as a two player game.

471
00:52:48,737 --> 00:52:54,595
So we have two players, a generator network as well as
an additional discriminator network that I'll show next.

472
00:52:54,595 --> 00:53:04,320
And our generator network is going to try to, as player one, it's going
to try to fool the discriminator by generating real looking images.

473
00:53:04,320 --> 00:53:12,462
And then our second player, our discriminator network is then
going to try to distinguish between real and fake images.

474
00:53:12,462 --> 00:53:23,323
So it wants to do as good a job as possible of trying to determine which of
these images are counterfeit or fake images generated by this generator.

475
00:53:25,425 --> 00:53:27,324
Okay so what this looks like is,

476
00:53:27,324 --> 00:53:31,203
we have our random noise going
to our generator network,

477
00:53:31,203 --> 00:53:36,121
generator network is generating these images that
we're going to call, they're fake from our generator.

478
00:53:36,121 --> 00:53:42,439
And then we're going to also have real images that
we take from our training set and then we want the

479
00:53:42,439 --> 00:53:50,881
discriminator to be able to distinguish
between real and fake images.

480
00:53:50,881 --> 00:53:52,849
Output real and fake for each images.

481
00:53:52,849 --> 00:54:01,638
So the idea is if we're able to have a very good discriminator, we want to train a
good discriminator, if it can do a good job of discriminating real versus fake,

482
00:54:01,638 --> 00:54:11,140
and then if our generator network is able to generate, if it's able to do
well and generate fake images that can successfully fool this discriminator,

483
00:54:11,140 --> 00:54:13,135
then we have a good generative model.

484
00:54:13,135 --> 00:54:17,431
We're generating images that look
like images from the training set.

485
00:54:19,482 --> 00:54:25,548
Okay, so we have these two players and so we're going
to train this jointly in a minimax game formulation.

486
00:54:25,548 --> 00:54:28,941
So this minimax objective
function is what we have here.

487
00:54:28,941 --> 00:54:37,399
We're going to take, it's going to be minimum over
theta G our parameters of our generator network G,

488
00:54:37,399 --> 00:54:44,848
and maximum over parameter Zeta of our Discriminator
network D, of this objective, right, these terms.

489
00:54:47,177 --> 00:54:49,624
And so if we look at these
terms, what this is saying

490
00:54:49,624 --> 00:54:54,910
is well this first thing, expectation
over data of log of D given X.

491
00:54:56,094 --> 00:55:01,151
This log of D of X is the
discriminator output for real data X.

492
00:55:01,151 --> 00:55:09,309
This is going to be likelihood of real data
being real from the data distribution P data.

493
00:55:09,309 --> 00:55:16,882
And then the second term here, expectation of Z drawn
from P of Z, Z drawn from P of Z means samples from

494
00:55:16,882 --> 00:55:27,577
our generator network and this term D of G of Z that we have here
is the output of our discriminator for generated fake data for our,

495
00:55:29,109 --> 00:55:33,769
what does the discriminator output
of G of Z which is our fake data.

496
00:55:36,311 --> 00:55:43,105
And so if we think about this is trying to do, our
discriminator wants to maximize this objective, right,

497
00:55:43,105 --> 00:55:53,278
it's a max over theta D such that D of X is close to
one. It's close to real, it's high for the real data.

498
00:55:53,278 --> 00:56:02,679
And then D of G of X, what it thinks of the fake data on
the left here is small, we want this to be close to zero.

499
00:56:02,679 --> 00:56:09,237
So if we're able to maximize this, this means discriminator
is doing a good job of distinguishing between real and zero.

500
00:56:09,237 --> 00:56:13,449
Basically classifying
between real and fake data.

501
00:56:13,449 --> 00:56:22,375
And then our generator, here we want the generator to minimize
this objective such that D of G of Z is close to one.

502
00:56:22,375 --> 00:56:35,236
So if this D of G of Z is close to one over here, then the one minus side is
small and basically we want to, if we minimize this term then, then it's having

503
00:56:36,768 --> 00:56:39,175
discriminator think that our
fake data's actually real.

504
00:56:39,175 --> 00:56:44,087
So that means that our generator
is producing real samples.

505
00:56:44,087 --> 00:56:51,139
Okay so this is the important objective of GANs to try
and understand so are there any questions about this?

506
00:56:51,139 --> 00:57:01,360
[student's words obscured due to lack of microphone] I'm not sure I understand
your question, can you, [student's words obscured due to lack of microphone]

507
00:57:12,334 --> 00:57:23,067
Yeah, so the question is is this basically trying to have the first network produce real
looking images that our second network, the discriminator cannot distinguish between.

508
00:57:30,474 --> 00:57:36,809
Okay, so the question is how do we actually label
the data or do the training for these networks.

509
00:57:36,809 --> 00:57:46,180
We'll see how to train the networks next. But in terms of like what is the
data label basically, this is unsupervised, so there's no data labeling.

510
00:57:46,180 --> 00:57:52,805
But data generated from the generator network, the
fake images have a label of basically zero or fake.

511
00:57:52,805 --> 00:58:00,344
And we can take training images that are real images
and this basically has a label of one or real.

512
00:58:00,344 --> 00:58:04,866
So when we have, the loss function
for our discriminator is using this.

513
00:58:04,866 --> 00:58:09,819
It's trying to output a zero for the generator
images and a one for the real images.

514
00:58:09,819 --> 00:58:12,048
So there's no external labels.

515
00:58:12,048 --> 00:58:15,136
[student's words obscured
due to lack of microphone]

516
00:58:15,136 --> 00:58:22,119
So the question is the label for the generator network
will be the output for the discriminator network.

517
00:58:22,119 --> 00:58:29,321
The generator is not really doing, it's not
really doing classifications necessarily.

518
00:58:29,321 --> 00:58:35,536
What it's objective is is here, D of
G of Z, it wants this to be high.

519
00:58:35,536 --> 00:58:42,487
So given a fixed discriminator, it wants to learn
the generator parameter such that this is high.

520
00:58:42,487 --> 00:58:47,752
So we'll take the fixed discriminator
output and use that to do the backprop.

521
00:58:51,447 --> 00:58:54,219
Okay so in order to train
this, what we're going to do

522
00:58:54,219 --> 00:58:57,714
is we're going to alternate
between gradient ascent

523
00:58:57,714 --> 00:59:05,222
on our discriminator, so we're trying to learn
theta beta to maximizing this objective.

524
00:59:05,222 --> 00:59:08,059
And then gradient
descent on the generator.

525
00:59:08,059 --> 00:59:15,698
So taking gradient ascent on these parameters theta G
such that we're minimizing this and this objective.

526
00:59:15,698 --> 00:59:23,748
And here we are only taking this right part over here because
that's the only part that's dependent on theta G parameters.

527
00:59:26,574 --> 00:59:30,603
Okay so this is how we can train this GAN.

528
00:59:30,603 --> 00:59:35,716
We can alternate between training our discriminator
and our generator in this game, each trying to fool

529
00:59:35,716 --> 00:59:40,561
the other or generator trying
to fool the discriminator.

530
00:59:40,561 --> 00:59:50,478
But one thing that is important to note is that in practice this generator
objective as we've just defined actually doesn't work that well.

531
00:59:50,478 --> 00:59:55,309
And the reason for this is we have
to look at the loss landscape.

532
00:59:55,309 --> 01:00:01,059
So if we look at the loss landscape
over here for D of G of X,

533
01:00:02,858 --> 01:00:10,654
if we apply here one minus D of G of X which is what we
want to minimize for the generator, it has this shape here.

534
01:00:12,748 --> 01:00:21,119
So we want to minimize this and it turns out the slope of
this loss is actually going to be higher towards the right.

535
01:00:21,119 --> 01:00:24,369
High when D of G of Z is closer to one.

536
01:00:26,915 --> 01:00:36,837
So that means that when our generator is doing a good job of fooling the
discriminator, we're going to have a high gradient, more higher gradient terms.

537
01:00:36,837 --> 01:00:44,794
And on the other hand when we have bad samples, our generator has
not learned a good job yet, it's not good at generating yet,

538
01:00:44,794 --> 01:00:52,159
then this is when the discriminator can easily tell
it's now closer to this zero region on the X axis.

539
01:00:53,002 --> 01:00:55,482
Then here the gradient's relatively flat.

540
01:00:55,482 --> 01:01:03,977
And so what this actually means is that our our gradient signal
is dominated by region where the sample is already pretty good.

541
01:01:05,200 --> 01:01:12,624
Whereas we actually want it to learn a lot when the samples are
bad, right? These are training samples that we want to learn from.

542
01:01:12,624 --> 01:01:21,664
And so in order to, so this basically makes it
hard to learn and so in order to improve learning,

543
01:01:21,664 --> 01:01:26,320
what we're going to do is define a different, slightly
different objective function for the gradient.

544
01:01:26,320 --> 01:01:30,145
Where now we're going to
do gradient ascent instead.

545
01:01:30,145 --> 01:01:35,748
And so instead of minimizing the likelihood of our
discriminator being correct, which is what we had earlier,

546
01:01:35,748 --> 01:01:40,908
now we'll kind of flip it and say let's maximize
the likelihood of our discriminator being wrong.

547
01:01:40,908 --> 01:01:49,720
And so this will produce this objective here
of maximizing, maximizing log of D of G of X.

548
01:01:50,767 --> 01:01:55,102
And so, now basically we want to,
there should be a negative sign here.

549
01:01:59,160 --> 01:02:08,659
But basically we want to now maximize this flip objective
instead and what this now does is if we plot this function

550
01:02:10,118 --> 01:02:16,149
on the right here, then we have a high gradient signal
in this region on the left where we have bad samples,

551
01:02:16,149 --> 01:02:23,242
and now the flatter region is to the
right where we would have good samples.

552
01:02:23,242 --> 01:02:26,571
So now we're going to learn more
from regions of bad samples.

553
01:02:26,571 --> 01:02:35,990
And so this has the same objective of fooling the discriminator but it
actually works much better in practice and for a lot of work on GANs that are

554
01:02:35,990 --> 01:02:41,492
using these kind of vanilla GAN formulation
is actually using this objective.

555
01:02:44,220 --> 01:02:59,079
Okay so just an aside on that is that jointly training these two networks is challenging and can be unstable.
So as we saw here, like we're alternating between training a discriminator and training a generator.

556
01:02:59,079 --> 01:03:08,398
This type of alternation is, basically it's hard to
learn two networks at once and there's also this issue

557
01:03:08,398 --> 01:03:13,815
of depending on what our loss landscape looks
at, it can affect our training dynamics.

558
01:03:13,815 --> 01:03:23,342
So an active area of research still is how can we choose objectives with
better loss landscapes that can help training and make it more stable?

559
01:03:26,516 --> 01:03:31,152
Okay so now let's put this all together and
look at the full GAN training algorithm.

560
01:03:31,152 --> 01:03:34,366
So what we're going to do is
for each iteration of training

561
01:03:34,366 --> 01:03:41,078
we're going to first train the generation, train the discriminator
network a bit and then train the generator network.

562
01:03:41,078 --> 01:03:43,959
So for k steps of training
the discriminator network

563
01:03:43,959 --> 01:03:55,859
we'll sample a mini batch of noise samples from our noise prior Z and
then also sample a mini batch of real samples from our training data X.

564
01:03:57,366 --> 01:04:04,519
So what we'll do is we'll pass the noise through
our generator, we'll get our fake images out.

565
01:04:04,519 --> 01:04:08,052
So we have a mini batch of fake
images and mini batch of real images.

566
01:04:08,052 --> 01:04:15,041
And then we'll pick a gradient step on the discriminator
using this mini batch, our fake and our real images

567
01:04:15,041 --> 01:04:17,891
and then update our
discriminator parameters.

568
01:04:17,891 --> 01:04:24,313
And use this and do this a certain number of iterations
to train the discriminator for a bit basically.

569
01:04:24,313 --> 01:04:28,803
And then after that we'll go to our second
step which is training the generator.

570
01:04:28,803 --> 01:04:32,544
And so here we'll sample just
a mini batch of noise samples.

571
01:04:32,544 --> 01:04:43,102
We'll pass this through our generator and then now we want to do backpop
on this to basically optimize our generator objective that we saw earlier.

572
01:04:45,078 --> 01:04:49,705
So we want to have our generator fool
our discriminator as much as possible.

573
01:04:50,773 --> 01:04:58,895
And so we're going to alternate between these two steps of taking
gradient steps for our discriminator and for the generator.

574
01:04:59,996 --> 01:05:07,709
And I said for k steps up here, for training the
discriminator and so this is kind of a topic of debate.

575
01:05:08,604 --> 01:05:15,391
Some people think just having one iteration of discriminator
one type of discriminator, one type of generator is best.

576
01:05:15,391 --> 01:05:20,744
Some people think it's better to train the discriminator
for a little bit longer before switching to the generator.

577
01:05:20,744 --> 01:05:30,732
There's no real clear rule and it's something that people have
found different things to work better depending on the problem.

578
01:05:30,732 --> 01:05:45,028
And one thing I want to point out is that there's been a lot of recent work that alleviates this problem and
makes it so you don't have to spend so much effort trying to balance how the training of these two networks.

579
01:05:45,028 --> 01:05:47,880
It'll have more stable training
and give better results.

580
01:05:47,880 --> 01:05:55,655
And so Wasserstein GAN is an example of a paper
that was an important work towards doing this.

581
01:06:00,313 --> 01:06:09,767
Okay so looking at the whole picture we've now trained, we have our network
setup, we've trained both our generator network and our discriminator network

582
01:06:09,767 --> 01:06:16,899
and now after training for generation, we can just take our
generator network and use this to generate new images.

583
01:06:16,899 --> 01:06:21,520
So we just take noise Z and pass this
through and generate fake images from here.

584
01:06:23,636 --> 01:06:28,351
Okay and so now let's look at some
generated samples from these GANs.

585
01:06:28,351 --> 01:06:33,099
So here's an example of trained on MNIST
and then on the right on Faces.

586
01:06:33,099 --> 01:06:43,849
And for each of these you can also see, just for visualization the closest, on the
right, the nearest neighbor from the training set to the column right next to it.

587
01:06:43,849 --> 01:06:49,227
And so you can see that we're able to generate very realistic
samples and it never directly memorizes the training set.

588
01:06:51,264 --> 01:06:56,061
And here are some examples from the
original GAN paper on CIFAR images.

589
01:06:56,061 --> 01:07:07,374
And these are still fairly, not such good quality yet, these were, the
original work is from 2014, so these are some older, simpler networks.

590
01:07:07,374 --> 01:07:11,541
And these were using simple,
fully connected networks.

591
01:07:12,550 --> 01:07:16,018
And so since that time there's been
a lot of work on improving GANs.

592
01:07:18,120 --> 01:07:31,388
One example of a work that really took a big step towards improving the quality of samples
is this work from Alex Radford in ICLR 2016 on adding convolutional architectures to GANs.

593
01:07:33,806 --> 01:07:42,958
In this paper there was a whole set of guidelines on
architectures for helping GANs to produce better samples.

594
01:07:42,958 --> 01:07:46,517
So you can look at this for more details.

595
01:07:46,517 --> 01:07:52,669
This is an example of a convolutional architecture
that they're using which is going from our input Z

596
01:07:52,669 --> 01:07:57,694
noise vector Z and transforming this
all the way to the output sample.

597
01:08:00,527 --> 01:08:08,251
So now from this large convolutional architecture we'll see that
the samples from this model are really starting to look very good.

598
01:08:08,251 --> 01:08:11,408
So this is trained on
a dataset of bedrooms

599
01:08:11,408 --> 01:08:15,575
and we can see all kinds of
very realistic fancy looking

600
01:08:16,783 --> 01:08:26,063
bedrooms with windows and night stands and other furniture
around there so these are some really pretty samples.

601
01:08:26,064 --> 01:08:32,346
And we can also try and interpret a
little bit of what these GANs are doing.

602
01:08:32,346 --> 01:08:42,817
So in this example here what we can do is we can take two points of Z, two
different random noise vectors and let's just interpolate between these points.

603
01:08:42,818 --> 01:08:50,142
And each row across here is an interpolation from
one random noise Z to another random noise vector Z

604
01:08:50,142 --> 01:08:57,072
and you can see that as it's changing, it's smoothly
interpolating the image as well all the way over.

605
01:08:59,286 --> 01:09:02,067
And so something else that
we can do is we can see that,

606
01:09:02,067 --> 01:09:10,313
well, let's try to analyze further what these vectors
Z mean, and so we can try and do vector math on here.

607
01:09:10,313 --> 01:09:17,828
So what this experiment does is it says
okay, let's take some images of smiling,

608
01:09:17,828 --> 01:09:26,628
samples of smiling women images and then let's take some samples
of neutral women and then also some samples of neutral men.

609
01:09:28,341 --> 01:09:34,920
And so let's try and do take the average of the Z
vectors that produced each of these samples and if we,

610
01:09:34,920 --> 01:09:45,037
Say we take this, mean vector for the smiling women, subtract the mean vector for
the neutral women and add the mean vector for the neutral man, what do we get?

611
01:09:46,651 --> 01:09:49,884
And we get samples of smiling man.

612
01:09:49,884 --> 01:09:56,200
So we can take the Z vector produced there,
generate samples and get samples of smiling men.

613
01:09:57,190 --> 01:10:03,879
And we can have another example of this. Of glasses
man minus no glasses man and plus glasses women.

614
01:10:05,918 --> 01:10:08,763
And get women with glasses.

615
01:10:08,763 --> 01:10:18,358
So here you can see that basically the Z has this type of interpretability
that you can use this to generate some pretty cool examples.

616
01:10:20,026 --> 01:10:23,967
Okay so this year, 2017 has really been
the year of the GAN.

617
01:10:24,842 --> 01:10:33,261
There's been tons and tons of work on GANs and it's really
sort of exploded and gotten some really cool results.

618
01:10:33,261 --> 01:10:38,680
So on the left here you can see people
working on better training and generation.

619
01:10:38,680 --> 01:10:45,621
So we talked about improving the loss functions, more
stable training and this was able to get really nice

620
01:10:47,216 --> 01:10:50,173
generations here of different
types of architectures

621
01:10:50,173 --> 01:10:54,326
on the bottom here really
crisp high resolution faces.

622
01:10:54,326 --> 01:11:01,742
With GANs you can also do, there's also been models on
source to try to domain transfer and conditional GANs.

623
01:11:01,742 --> 01:11:08,363
And so here, this is an example of source to try to get
domain transfer where, for example in the upper part

624
01:11:08,363 --> 01:11:14,703
here we are trying to go from source domain
of horses to an output domain of zebras.

625
01:11:14,703 --> 01:11:25,813
So we can take an image of horses and train a GAN such that the output is going
to be the same thing but now zebras in the same image setting as the horses

626
01:11:28,408 --> 01:11:33,124
and go the other way around.
We can transform apples into oranges.

627
01:11:33,124 --> 01:11:38,608
And also the other way around. We can
also use this to do photo enhancement.

628
01:11:38,608 --> 01:11:52,379
So producing these, really taking a standard photo and trying to make really nice, as if you had,
pretending that you have a really nice expensive camera. That you can get the nice blur effects.

629
01:11:52,379 --> 01:12:03,750
On the bottom here we have scene changing, so transforming an image of
Yosemite from the image in winter time to the image in summer time.

630
01:12:03,750 --> 01:12:05,753
And there's really tons of applications.

631
01:12:05,753 --> 01:12:16,373
So on the right here there's more. There's also going from a text description and
having a GAN that's now conditioned on this text description and producing an image.

632
01:12:18,343 --> 01:12:26,421
So there's something here about a small bird with a pink breast
and crown and now we're going to generate images of this.

633
01:12:26,421 --> 01:12:37,383
And there's also examples down here of filling in edges. So given conditions on some
sketch that we have, can we fill in a color version of what this would look like.

634
01:12:40,848 --> 01:12:50,416
Can we take a Google, a map grid and put something that looks like Google
Earth on, and turn it into something that looks like Google Earth.

635
01:12:52,528 --> 01:12:56,767
Go in and hallucinate all of these
buildings and trees and so on.

636
01:12:56,767 --> 01:13:07,061
And so there's lots of really cool examples of this. And there's also this website
for pics to pics which did a lot of these kind of conditional GAN type examples.

637
01:13:08,077 --> 01:13:17,549
I encourage you to go look at for more interesting
applications that people have done with GANs.

638
01:13:17,549 --> 01:13:24,640
And in terms of research papers there's also there's
a huge number of papers about GANs this year now.

639
01:13:26,047 --> 01:13:31,365
There's a website called the GAN Zoo that kind
of is trying to compile a whole list of these.

640
01:13:31,365 --> 01:13:44,794
And so here this has only taken me from A through C on the left here and through like L on the right. So
it won't even fit on the slide. There's tons of papers as well that you can look at if you're interested.

641
01:13:44,794 --> 01:13:57,376
And then one last pointer is also for tips and tricks for training GANs, here's a nice
little website that has pointers if you're trying to train these GANs in practice.

642
01:14:01,313 --> 01:14:06,915
Okay, so summary of GANs. GANs don't
work with an explicit density function.

643
01:14:06,915 --> 01:14:13,989
Instead we're going to represent this implicitly through
samples and they take a game-theoretic approach to training

644
01:14:13,989 --> 01:14:18,973
so we're going to learn to generate from our training
distribution through a two player game setup.

645
01:14:18,973 --> 01:14:26,212
And the pros of GANs are that they're really having gorgeous
state of the art samples and you can do a lot with these.

646
01:14:26,212 --> 01:14:33,247
The cons are that they are trickier and more unstable
to train, we're not just directly optimizing

647
01:14:36,499 --> 01:14:41,830
a one objective function that we can
just do backpop and train easily.

648
01:14:41,830 --> 01:14:47,710
Instead we have these two networks that we're trying to
balance training with so it can be a bit more unstable.

649
01:14:47,710 --> 01:14:57,629
And we also can lose out on not being able to do some of the inference
queries, P of X, P of Z given X that we had for example in our VAE.

650
01:14:57,629 --> 01:15:07,040
And GANs are still an active area of research, this is a relatively new type of
model that we're starting to see a lot of and you'll be seeing a lot more of.

651
01:15:07,040 --> 01:15:20,633
And so people are still working now on better loss functions more stable training, so Wasserstein
GAN for those of you who are interested is basically an improvement in this direction.

652
01:15:22,224 --> 01:15:31,489
That now a lot of people are also using and basing models off of. There's also
other works like LSGAN, Least Square's GAN, Least Square's GAN and others.

653
01:15:31,489 --> 01:15:39,307
So you can look into this more. And a lot of times for these new models in
terms of actually implementing this, they're not necessarily big changes.

654
01:15:39,307 --> 01:15:44,279
They're different loss functions that you can change a
little bit and get like a big improvement in training.

655
01:15:44,279 --> 01:15:51,500
And so this is, some of these are worth looking into and
you'll also get some practice on your homework assignment.

656
01:15:51,500 --> 01:15:59,946
And there's also a lot of work on different types of conditional GANs
and GANs for all kinds of different problem setups and applications.

657
01:16:01,648 --> 01:16:05,807
Okay so a recap of today.
We talked about generative models.

658
01:16:05,807 --> 01:16:12,329
We talked about three of the most common kinds of generative
models that people are using and doing research on today.

659
01:16:12,329 --> 01:16:17,588
So we talked first about pixelRNN and
pixelCNN, which is an explicit density model.

660
01:16:17,588 --> 01:16:26,981
It optimizes the exact likelihood and it produces good samples but
it's pretty inefficient because of the sequential generation.

661
01:16:26,981 --> 01:16:35,090
We looked at VAE which optimizes a variational or lower bound on the
likelihood and this also produces useful a latent representation.

662
01:16:35,090 --> 01:16:40,305
You can do inference queries. But the
example quality is still not the best.

663
01:16:40,305 --> 01:16:47,657
So even though it has a lot of promise, it's still a very
active area of research and has a lot of open problems.

664
01:16:47,657 --> 01:16:57,375
And then GANs we talked about is a game-theoretic approach for training
and it's what currently achieves the best state of the art examples.

665
01:16:57,375 --> 01:17:05,047
But it can also be tricky and unstable to train
and it loses out a bit on the inference queries.

666
01:17:05,047 --> 01:17:10,239
And so what you'll also see is a lot of recent
work on combinations of these kinds of models.

667
01:17:10,239 --> 01:17:12,733
So for example adversarial autoencoders.

668
01:17:12,733 --> 01:17:18,478
Something like a VAE trained with an additional adversarial
loss on top which improves the sample quality.

669
01:17:18,478 --> 01:17:32,444
There's also things like pixelVAE is now a combination of pixelCNN and VAE so there's a lot
of combinations basically trying to take the best of all these worlds and put them together.

670
01:17:32,444 --> 01:17:40,449
Okay so today we talked about generative models and next
time we'll talk about reinforcement learning. Thanks.